Accelerating Large Scale Scientific Exploration through Data Diffusion

نویسندگان

  • Ioan Raicu
  • Yong Zhao
  • Ian Foster
  • Alex Szalay
چکیده

Scientific and data-intensive applications often require exploratory analysis on large datasets, which is often carried out on large scale distributed resources where data locality is crucial to achieve high system throughput and performance. We propose a “data diffusion” approach that acquires resources for data analysis dynamically, schedules computations as close to data as possible, and replicates data in response to workloads. As demand increases, more resources are acquired and “cached” to allow faster response to subsequent requests; resources are released when demand drops. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on the application workloads and the performance characteristics of the underlying infrastructure. This data diffusion concept is reminiscent of cooperative Web-caching and peer-topeer storage systems. Other data-aware scheduling approaches assume static or dedicated resources, which can be expensive and inefficient if load varies significantly. The challenges to our approach are that we need to co-allocate storage resources with computation resources in order to enable the efficient analysis of possibly terabytes of data without prior knowledge of the characteristics of application workloads. To explore the proposed data diffusion, we have developed Falkon, which provides dynamic acquisition and release of resources and the dispatch of analysis tasks to those resources. We have extended Falkon to allow the compute resources to cache data to local disks, and perform task dispatch via a data-aware scheduler. The integration of Falkon and the Swift parallel programming system provides us with access to a large number of applications from astronomy, astro-physics, medicine, and other domains, with varying datasets, workloads, and analysis codes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Data-Driven Large Scale Scientific Visualization and Exploration

Title of dissertation: TOWARDS DATA-DRIVEN LARGE-SCALE SCIENTIFIC VISUALIZATION AND EXPLORATION

متن کامل

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

Gray and Szalay [2] documented the data avalanche problem in the sciences in which improvements in physical instruments and better data pipelines lead to an exponential growth in data size. In Astronomy for example, the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) produces tens of terabytes daily [3]. Exploring the resulting, massive amounts of data is of immense scientific...

متن کامل

2- and 3-dimensional synthetic large-scale de novo patterning by mammalian cells through phase separation.

Synthetic biology provides an opportunity for the construction and exploration of alternative solutions to biological problems - solutions different from those chosen by natural life. To this end, synthetic biologists have built new sensory systems, cellular memories, and alternative genetic codes. There is a growing interest in applying synthetic approaches to multicellular systems, especially...

متن کامل

Foresight: Rapid Data Exploration Through Guideposts

Current tools for exploratory data analysis (EDA) require users to manually select data attributes, statistical computations and visual encodings. This can be daunting for large-scale, complex data. We introduce Foresight, a visualization recommender system that helps the user rapidly explore large high-dimensional datasets through “guideposts.” A guidepost is a visualization corresponding to a...

متن کامل

Visualization in Radiation Oncology: Towards Replacing the Laboratory Notebook

Data exploration in radiation oncology requires the creation of a large number of visualizations. For treatment planning, detailed information about the processes used to manipulate data collected and to create visualizations is needed for assessing the quality of the results. Current visualization systems allow the interactive creation and manipulation of complex visualizations. However, they ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007